Influential Features Pca for High Dimensional Clustering
نویسندگان
چکیده
We consider a clustering problem where we observe feature vectors Xi ∈ R, i = 1, 2, . . . , n, from K possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of p n, where classical clustering methods face challenges. We propose Influential Features PCA (IF-PCA) as a new clustering procedure. In IF-PCA, we select a small fraction of features with the largest Kolmogorov-Smirnov (KS) scores, obtain the first (K−1) left singular vectors of the post-selection normalized data matrix, and then estimate the labels by applying the classical k-means procedure to these singular vectors. In this procedure, the only tuning parameter is the threshold in the feature selection step. We set the threshold in a data-driven fashion by adapting the recent notion of Higher Criticism. As a result, IF-PCA is a tuning-free clustering method. We apply IF-PCA to 10 gene microarray data sets. The method has competitive performance in clustering. Especially, in three of the data sets, the error rates of IF-PCA are only 29% or less of the error rates by other methods. We have also rediscovered a phenomenon on empirical null by Efron (2004) on microarray data. With delicate analysis, especially post-selection eigen-analysis, we derive tight probability bounds on the Kolmogorov-Smirnov statistics and show that IF-PCA yields clustering consistency in a broad context. The clustering problem is connected to the problems of sparse PCA and low-rank matrix recovery, but it is different in important ways. We reveal an interesting phase transition phenomenon associated with these problems and identify the range of interest for each.
منابع مشابه
Discussion of Influential Features Pca for High Dimensional Clustering
We commend Jin and Wang on a very interesting paper introducing a novel approach to feature selection within clustering and a detailed analysis of its clustering performance under a Gaussian mixture model. I shall divide my discussion into several parts: (i) prior work on feature selection and clustering; (ii) theoretical aspects; (iii) practical aspects; and finally (iv) some questions and dir...
متن کاملDiscussion of “ Influential Feature Pca for High Dimensional Clustering ”
We would like to congratulate the authors for an interesting paper and a novel proposal for clustering high-dimensional Gaussian mixtures with a diagonal covariance matrix. The proposed two-stage procedure first selects features based on the Kolmogorov-Smirnov statistics and then applies a spectral clustering method to the post-selected data. A rigorous theoretical analysis for the clustering e...
متن کاملHigh-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملImportant Features PCA for high dimensional clustering
We consider a clustering problem where we observe feature vectors Xi ∈ R, i = 1, 2, . . . , n, from K possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of p n, where classical clustering methods face challenges. We propose Important Features PCA (IF-PCA) as a new clustering procedure. In IFPCA, we select a ...
متن کاملPhase Transitions for High Dimensional Clustering and Related Problems
Consider a two-class clustering problem where we observe Xi = `iμ + Zi, Zi iid ∼ N(0, Ip), 1 ≤ i ≤ n. The feature vector μ ∈ R is unknown but is presumably sparse. The class labels `i ∈ {−1, 1} are also unknown and the main interest is to estimate them. We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we fin...
متن کامل